Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style - it is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.
The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
This is the fifth in a series of seven lectures. Last lecture we learned the basics of data visualization with the ggplot2 package. This week we return to the world of data cleaning with a very important tool - the regular expression! At the end of this session you will be able to use tidyverse tools and regular expressions to tidy/clean your data. This week our topics are broken into:
stringr package.Grey background: Command-line code, R library and function names... fill in the code here if you are coding alongEach week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.
Today we have 2 data files to help us work through the concepts of data cleaning with regular expressions.
This is an example file for us to start playing with the idea of regular expressions.
This is the main file that we'll be working with for the rest of the lecture. We'll search, replace, and manipulate data from this file after importing it into our notebooks.
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for you, like dplyr, readr, readxl, tibble, and ggplot2). In particular we will be taking advantage of the stringr package this week.Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
#--------- Load packages to for today's session ----------#
library(tidyverse)
Warning message: "package 'tidyverse' was built under R version 4.0.5" -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- v ggplot2 3.3.3 v purrr 0.3.4 v tibble 3.1.1 v dplyr 1.0.6 v tidyr 1.1.3 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.1 Warning message: "package 'ggplot2' was built under R version 4.0.5" Warning message: "package 'tibble' was built under R version 4.0.5" Warning message: "package 'tidyr' was built under R version 4.0.5" Warning message: "package 'dplyr' was built under R version 4.0.5" Warning message: "package 'forcats' was built under R version 4.0.5" -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag()
In previous weeks the data cleaning we've worked with has been more in the realm of data management - moving columns, converting from wide to long format, and making new variables. We've been more focused on getting the data into a proper format for analysis. Aside from splitting multi-variable columns apart, however, we have done very little to alter the raw data values and headings themselves.
Why do we need to do this?
'Raw' data is seldom (never) in a usable format. Data in tutorials or demos have already been meticulously filtered, transformed and readied to showcase that specific analysis. How many people have done a tutorial only to find they can't get their own data in the format to use the tool they have just spent an hour learning about?
Data cleaning requires us to:
Some definitions might take this a bit farther and include normalizing data and removing outliers. In this course, we consider data cleaning as getting data into a format where we can start actively exploring our data with graphics, data normalization, etc.
Today we are going to mostly focus on the data cleaning of text. This step is crucial for taking control of your dataset and your metadata. I have included the functions I find most useful for these tasks but I encourage you to take a look at the Strings Chapter in R for Data Science for an exhaustive list of functions. We have learned how to transform data into a tidy format in lectures 2 and 3, but the prelude to transforming data is doing the grunt work of data cleaning. So let's get to it!
"A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as 'write only', because regular expressions are easier to write than to read/understand. And they are not particularly easy to write." - Jenny Bryan
RegEx is a very sophisticated way to find, replace, and extract information from strings.
For our first regex exercise, use Microsoft Word to open the file "regex_word.docx". This file contains one string: "Bob and Bobby went to Runnymede Road for a run and then went apple bobbing.". Here is what we are going to do:
So why do regular expressions or 'RegEx' get so much flak if it is so powerful for text matching? Scary example: how to get an email in different programming languages http://emailregex.com/.
Writing/reading Regex is definitely one of those situations where you should annotate your code. There are many terrifying urban legends about people coming back to their code and having no idea what their code means.
There are sites available to help you make up your regular expressions and validate them against text. These are usually not R specific, but they will get you close and the expression will only need a slight modification for R (like an extra backslash - described below).
ReGex testers:
https://regex101.com/
https://regexr.com/
Today we will be practicing RegEx at Simon Goring's R-specific demo tester:
https://simongoring.shinyapps.io/RegularExpressionR/
What does the language look like?
The language is based on meta-characters which have a special meaning rather than their literal meaning. For example, '\$' is used to match the end of a string, and this use supersedes its use as a character in a string (i.e. 'Joe paid \$2.99 for chips.').
What kind of character is it? Classes identify specifically reserved groups of characters and are used as a quick way to represent them. There are two ways to define classes - either using special "escaped" characters denoted with a \ or using the [ ] notation within a regular expression.
| Expression | Meaning |
|---|---|
| \w, [A-z, 0-9], [[:alnum:]] | word characters (letters + digits) |
| \d, [0-9], [[:digit:]] | digits |
| [A-z], [[:alpha:]] | alphabetical characters |
| \s, [[:space:]] | space |
| [[:punct:]] | punctuation |
| [[:lower:]] | lowercase |
| [[:upper:]] | uppercase |
| \W, [^A-z0-9] | not word characters |
| \S | not space |
| \D, [^0-9] | not digits |
Note that some of these are not universal but rather specific to POSIX bracket expression ie [:xx:] and must be used within a secondary set of brackets to be valid. This kind of syntax is compatible with many regex systems including R and Unix.
When we are thinking about patterns, we may expect the repetition of patterns within patterns. To account for this as a language, regular expressions use quantifiers to denote the presence or absence of specific patterns. In other words, we use quantifiers to answer "How many times will a character appear?"
| Expression | Meaning |
|---|---|
| ? | 0 or 1 occurrence |
| * | 0 or more occurrences |
| + | 1 or more occurrences |
| {n} | exactly n occurrences |
| {n,} | at least n occurrences |
| {,n} | at most n occurrences |
| {n,m} | between n and m occurrences (inclusive) |
You may recall from lecture 3 that we encountered the idea of conditional operators. Regular expressions can also rely on the | OR operator to denote that a subpattern or it's alternative must be present. Additional operators are used to classify character groups. Much like in classes, we can define specific groups of characters to look for within our patterns. Overall, the operators can be thought of as helper actions to match your characters.
| Expression | Meaning |
|---|---|
| | | or |
| . | matches any single character |
| [ ... ] | matches ANY of the characters inside the brackets |
| [ ... - ... ] | matches a RANGE of characters inside the brackets |
| [ ^... ] | matches any character EXCEPT those inside the bracket |
| ( ... ) | grouping - used for [backreferencing] (https://www.regular-expressions.info/backref.html) |
Recall we are working with strings - these can be long or short but they all have a start and an end. Sometimes the patterns we are concerned with are located at the beginning or end of a string (think prefixes and suffixes). We use special position-matching characters to denote these ideas.
| Expression | Meaning |
|---|---|
| ^ | start of the string |
| $ | end of the string |
| \\b | empty string at either edge of the word |
| \\B | empty string that is NOT at the edge of a word |
\¶This can be one of the hardest concepts with regular expressions but remember that we just went over multiple lists where characters had their own meaning like $ or (. These are called metacharacters because they have meaning beyond just representing string characters. To a regular expression they initiate a specific kind of analysis of the characters that follow.
Escape sequences \, however, allow you to use a character 'as is' rather than its special RegEx function. In R, RegEx are evaluated as strings first, then as a regular expression unless we specify we are writing a regular expression. A backslash is used to escape each RegEx character in the string, so often we will need 2 backslashes to escape. For instance if we want to use the string $100 in a RegEx search term, then we need to tell R that we just want to work with the character $ and not look for the pattern 100 at the end of a string.
If we used \$100 we wouldn't get what we want because the \ is also a metacharacter that needs be interpreted as a slash first by R! Instead we must use \\$100: one backslash (escape character) per character, so at the end we end up with a RegEx that looks like \\$. To the RegEx function, however, we've only provided \$.
| Expression | Meaning |
|---|---|
| \\ | escape for meta-characters to be used as characters (*, $, ^, ., ?, |, \, [, ], {, }, (, )). |
| Note: the backslash is also a meta-character. |
Trouble-shooting with escaping meta-characters means adding backslashes until something works.

While you can always refer back to this lesson for making your regular expressions, you can also use this RegEx cheatsheet.
Now that we have some basic ideas and terms under our belts, let's get to working with actual regular expressions. We will kick ofF this RegEx lecture by playing around with an online tool: https://regexr.com/.
In the text section of the online tool, replace the default text by:
>NP_001009020.1 forkhead box protein P2 [Pan troglodytes] MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQT SGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYK
We are going to explore the following expression by breaking it down into its components:
^>(\w+)\.(\d)\s(.+)\[(\w+)\s(\w+).*\]
What do you think it matches in our text?
| Expression | Meaning |
|---|---|
| ^ | Beginning of a line |
| > | Greater than symbol, used to designate the beginning of a new read in fasta format |
| \w | Match a word (alphabetic) character (Remember that a second escape character (backslash) is required) |
| + | Match the preceding metacharacter more than once |
| . | match a literal period |
| \d | Match a digit |
| \s | space |
| .+ | anything, one or more times |
| \[ | match a literal squared bracket |
| .* | anything, as many times it is present |
stringr¶Common uses of string manipulation are: Searching, replacing or removing (making substitutions), and splitting and combining substrings.
Base R offers a variety of built-in RegEx functions such as grep(), grepl(), and gsub() that are common to other programming languages. Even though these functions are computationally very efficient, it can be very challenging to master their use. Tidyverse's stringr package offers a more comprehensive, user-friendly set of RegEx-compatible functions that are easier to work with for those unfamiliar with regular expressions. Thus, we will stick to stringr to get some hands-on RegEx experience.
As an example, we are going to play with a string of DNA.
# Assign our sequence to dino
dino <- ">DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT"
This piece of DNA is from the book Jurassic park, and was supposed to be dinosaur DNA, but is actually just a cloning vector. Bummer.
RStudio and Jupyter notebooks have their own RegEx-compatible find-replace functionality in its graphic user interface (GUI) that you can use. In Jupyter you can find this on the menu under Edit > Find and Replace. As a quick example, let's find out where in this notebook we can find the string GCGTTGCTGGCG. You can also check other boxes if your search is case sensitive or if you are looking for whole words.
str_remove() and str_remove_all()¶Our string "dino" is in FASTA format, but we don't need the header; we just want to deal with the DNA sequence. The header begins with '>' and ends with a number, '1200', with a space between the header and the sequence. Let's practice capturing each of these parts of a string, and then we'll make a regular expression to remove the entire header.
All stringr functions take in as arguments the string you are manipulating and the pattern you are capturing. str_remove replaces the matched pattern with an empty character string "". In our first search we remove '>' from our string, dino.
# Remove the pattern that matches a ">"
str_remove(string = dino, pattern = ">")
Next we can search for numbers. The expression '[0-9]' is looking for any number. Always make sure to check that the pattern you are using gives you the output you expect.
# Remove the first instance of an integer 0-9
str_remove(string = dino, pattern = "[0-9]")
str_remove() replaces the first instance of a search string versus str_remove_all()¶Why aren't all of the numbers replaced? str_remove only replaces the first match in a character string. Switching to str_remove_all replaces all instances of matches in the character string.
# Remove all instances of integers 0-9
str_remove_all(string = dino, pattern = "[0-9]")
How do we capture spaces? The pattern \s replaces a space. However, for the backslash to not be used as an escape character (its special function), we need to add another backslash, making our pattern \\s. In other words, you need to escape the backslash itself with another backslash.
# Remove all spaces from dino
str_remove_all(string = dino, pattern = "\\s")
To remove the entire header, we need to combine these patterns. Remember, we are really just converting the header string into a simple description - what are the base components of the header and what pattern do they follow?
The header is everything in between > and the number 1200 followed by a ` (space). The operator.captures any single character and the quantifier*` matches it any number of times (including zero).
# What can we remove without using the quantifier?
str_remove(string = dino, pattern = ">.")
# What can we remove without using the quantifier?
str_remove(string = dino, pattern = ">.[0-9]\\s")
# removes > and D, the first two characters in the string
# Remove the entire header from the sequence
str_remove(string = dino, pattern = ">.*[0-9]\\s") # remove entire header
>DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 GCGTTGCTGGCGTTTTTCCATAGGCTCCG...
versus
>DinoDNA from Crichton JURASSIC PARK p. 103
You may have noticed that we also have a number followed by a space earlier in the header, '103 '. Why didn't the replacement end at that first match? The first instance is an example of greedy matching - it will capture the longest possible string.
To curtail this behavior and use lazy matching - the shortest possible string - you can add the ? quantifier. Remember that this metacharacter can already signify looking for 0 or 1 occurences of a pattern.
In this case, we are going to use it to make the preceding quantifier * lazy by causing it to match as few character as possible while still matching the remainder of the pattern.
# What happens if we modify our * with lazy matching?
str_remove(string = dino, pattern = ">.*?[0-9]\\s")
# lazy matching removes the first occurrence of a group of digits followed by a space
The concepts of greediness and lazyness also play a part in why stringr has seemingly redundant functions such as str_extract() and str_extract_all(), or str_view() and str_view_all(), among others.
In this case, we want the greedy matching to replace the entire header. Let's save the headerless dna sequence into its own object.
# Save DNA sequence using the correct header Regex
dna <- str_remove(string = dino, pattern = ">.*[0-9]\\s")
dna
r"( )" to simplify your expressions¶Now that we've walked through the complex process of separating our FASTA header from our DNA sequence, we can appreciate some of the finer nuances with using regular expressions. The biggest hurdle is always the process of escape characters. For instance in section 3.2.2.2 we emphasized the importance of identifying the common blank space in our string using \\s - even though we know that the metacharacter for a space is \s. Of course things can get quite complicated if our source string also has metacharacters! Take a look at the following example code:
bad_string = "you\\stay\\here"
# How the string is stored
bad_string
# How it is interpreted
cat(bad_string)
you\stay\here
If we wanted to search through that string for the substring that include \\s what is our regular expression? Let's try to find u\\s to convince ourselves that we're getting the right part of the substring. We'll use the str_remove() function to help us out.
# Remove the u\\s from our string
str_remove(bad_string, pattern = "u\\\\s")
As of R version 4.0.0 and above, the ability to produce raw string literals has been added. What does that mean? It means that instead of having to use additional \ escape characters for escape characters, we can use the regular expression as intended to search for the exact pattern you want.
We initiate a raw string literal with the syntax r"(your_regex_here)". That means we need to enclose our regex pattern within both a set of double quotes, and parentheses "( )" while also including the prefix r. While not a perfect solution as we'll see later in section 4.2.5 when working with capture groups, this does offer some simplification when working with RegEx in R.
Let's see how we would code the above example and our original dino sequence example
# Use a raw string literal to remove u\\s from our bad_string
str_remove(bad_string, pattern = r"(u\\s)")
# Our original code for removing spaces in our header
str_remove_all(string = dino, pattern = "\\s")
# Convert it to a raw string literal. Will this work?
str_remove_all(string = dino, pattern = r"(\\s)")
# Use the proper raw string literal
str_remove_all(string = dino, pattern = r"(\s)")
str_extract()¶We may also want to retain our header in a separate string. str_extract() Will retain the string that matches our pattern instead of removing it. We can save this in an object called header. Note that we have removed the final space \\s from our expression because it's more of a separator between header and sequence.
# Save the header information to a variable
header <- str_extract(string = dino, pattern = r"(>.*[0-9])")
# Check that it worked as expected
header
str_extract_all()¶Now we can look for patterns in our (dino) DNA!
Does this DNA have balanced GC content? We can use str_extract_all to capture every character that is either a G or a C.
# Use extract all to find any Gs or Cs
str_extract_all(dna, pattern = "G|C")
# This is equivalent when using str_extract_all
str_extract_all(dna, pattern = "[GC]")
The output is a list object in which is stored an entry for each G or C extracted. We count the number of occurrences of G and C using str_count and divide by the total number of characters in our string to get the %GC content.
# Take advantage of helper functions like str_length()
str_count(dna, pattern = "G|C")/str_length(dna) * 100 # percentage of GC content
str_replace_all()¶Let's translate this into mRNA!
To replace multiple patterns at once, a named character vector is supplied to str_replace_all() of patterns and their matched replacements. This allows us to perform multiple replacements multiple times. If you wanted to replace just the first instance of a match, you would use str_replace() instead.
Both str_replace*() functions include the argument replacement in addition to string and pattern. In this case, however, our named character vector is provided to the pattern argument.
# Use str_replace_all to transcribe our DNA sequence
mrna <- str_replace_all(string = dna,
pattern = c("G" = "X", # placeholder to prevent all C from becoming G
"C" = "G",
"A" = "U",
"T" = "A",
"X" = "C"))
# Look at the resulting mRNA sequence
mrna
How do we query our sequence for the presence of specific patterns or motifs?
str_detect()¶Is there even a start codon in this sequence? str_detect can be used to return a logical (TRUE or FALSE) vector to whether or not a match is found.
# Check for the presence of a start codon
str_detect(string = mrna,
pattern = "AUG")
str_count()¶It might be more useful to know exactly how many possible start codons we have. str_count() will count the number of matches in the string for our pattern.
str_count(mrna, "AUG")
str_locate()¶To get the position of a possible start codon we can use str_locate(), which will return the indices (coordinates) of where the start and end of our FIRST substring occurs. Values are returned in a 2-column matrix.
str_locate_all() can be used to find all possible locations but a list of 2-column matrices is returned, using a different matrix for each input string.
# try out string locate on mrna
...(c(mrna, mrna), "AUG")
# vs string locate all on mrna
...(mrna, "AUG")
str_sub()¶Sometime rather than keeping or discarding a portion of a string and it's pattern, you may want to split the data using a specific regular expression.
Let's split our mrna sequence into substrings of codons, starting at the position of our start codon. We have the position of our start codon from str_locate(). We can use str_sub( to subset the string by position (we will just go to the end of the string for now).
str_sub(string, start, end) where
start defaults to the first character of stringend defaults to the last character of string# Examine the mRNA sequence
mrna
# Subsetting from position 90 (beginning of start codon) to base 1,200 (last base in our sequence).
# Note that the beginning of the sequence changed
str_sub(mrna, ... = 90, ... = 1200)
# is equivalent to
str_sub(mrna, str_locate(mrna, "AUG")[1]) # Look at the nested code (one function inside another)
# Equivalent code using the piping format
mrna %>%
str_locate(., "AUG") %>% # Locate the first "AUG"
... %>% # Take it's location from the first column of the output
str_sub(mrna, .) # Retrieve the mRNA sequence from that point onwards
# Replace the original mRNA sequence with the trimmed version
mrna <- str_sub(mrna, str_locate(mrna, "AUG")[1])
mrna
...¶Remember that . represents any character so we can get codons by extracting groups of (any) 3 nucleotides/characters in our reading frame. Note that these methods search for non-overlapping pattern matches.
# think of these dots as wildcards to pick any character triplets
str_extract_all(mrna, ...)
# how many codons did we retrieve?
length(str_extract_all(mrna, ...))
list as output.¶Why is the length of our str_extract_all output just a 1? The codons are extracted into a list, but we can get our character substrings using unlist().
# Unlist our results
codons <-
str_extract_all(mrna, "...") %>% # Make the extraction
... # Unlist the resulting extracted output
# Take a look at the resulting
codons
length(codons)
|¶We now have a vector with 370 codons. Do we have a stop codon in our reading frame?
Remember we've seen a couple of ways to search for multiple patterns in a single RegEx call. Let's check with str_detect(). We can use round brackets ( ) in our search pattern to separately group the different stop codons. We'll use the | (OR) metacharacter to search for the occurence of any of our 3 stop codon sequences.
Notice how in this case we are supplying a character vector codons?
str_detect(codons, "(UAG)|(UGA)...")
str_which()¶Looks like we have many matches. We can subset the codons using str_detect() (instances where the presence of a stop codon is equal to TRUE) to see which stop codons are represented.
Recall from Lecture 01 that we can use the which() function to find which indices the stop codons are positioned at. The call comes in the format of which(data_vector LOGICAL/BOOLEAN desired_value) where:
data_vector is your vector or arrayLOGICAL/BOOLEAN is the equivalency you're looking for like ==, >, !=, etc.desired_value determines which results are determined as "TRUE" and returned as indices of data_vectorSo you could combine the two commands which() and str_detect() to produce output identifying the index location of hits for your search pattern. The stringr package, however, has generated a wrapper function that implements that combination for us in the str_which() function! It simplifies the syntax required and works similarly to str_detect(), requiring just a string and a pattern. You can also negate the result with the negate = TRUE parameter.
# Which returns a numerical vector of indices
...(codons, "UAG|UGA|UAA")
# Note that this is equivalent
which(str_detect(codons, ...)) # Output is boolean
Let's subset codons to end at the first stop codon.
# Simply grab coodons by index position
codons[...]
#equivalent to
translation <- codons[...]
# The search will be stopped at the first detection of either of the stop codons listed
translation
After finding our unique codons, we can translate codons into their respective proteins by using str_replace_all using multiple patterns and replacements as before.
unique(translation)
codon_translation = c("UUU" = "F", "UCU" = "S", "UAU" = "Y", "UGU" = "C",
"UUC" = "F", "UCC" = "S", "UAC" = "Y", "UGC" = "C",
"UUA" = "L", "UCA" = "S", "UAA" = "*", "UGA" = "*",
"UUG" = "L", "UCG" = "S", "UAG" = "*", "UGG" = "W",
"CUU" = "L", "CCU" = "P", "CAU" = "H", "CGU" = "R",
"CUC" = "L", "CCC" = "P", "CAC" = "H", "CGC" = "R",
"CUA" = "L", "CCA" = "P", "CAA" = "Q", "CGA" = "R",
"CUG" = "L", "CCG" = "P", "CAG" = "Q", "CGG" = "R",
"AUU" = "I", "ACU" = "T", "AAU" = "N", "AGU" = "S",
"AUC" = "I", "ACC" = "T", "AAC" = "N", "AGC" = "S",
"AUA" = "I", "ACA" = "T", "AAA" = "K", "AGA" = "R",
"AUG" = "M", "ACG" = "T", "AAG" = "K", "AGG" = "R",
"GUU" = "V", "GCU" = "A", "GAU" = "D", "GGU" = "G",
"GUC" = "V", "GCC" = "A", "GAC" = "D", "GGC" = "G",
"GUA" = "V", "GCA" = "A", "GAA" = "E", "GGA" = "G",
"GUG" = "V", "GCG" = "A", "GAG" = "E", "GGG" = "G")
translation <- str_replace_all(translation, ...)
translation
str_flatten()¶What is our final protein string? str_flatten allows us to collapse our individual protein strings (or any character vector) into one long string.
...(translation)
# Equivalent call using the paste function
paste(translation, collapse="")
translation <- ...
translation
str_c()¶We can add our header back using str_c, which allows us to combine strings. We can use a space to separate our original strings with the sep parameter.
# recombine header with our translation and separate with a space
...(header, translation, sep = ...)
Now that we've walked through some of the basic functions that utilize RegEx, let's try some more advanced examples.
In the next exercise, we convert a multifasta file into a csv file that can be viewed in spreadsheet software such as MS Excel. Here are the instructions:
Luckily for you, the data is already in your folders but if you wanted to get the data yourself, it would go something like this.
Go to NCBI
Search "FoxP2[gene] AND primates[orgn" (without the quotation marks)
Under "Proteins", Select "Protein"
Under "Source Databases" (left side) Select "RefSeq". This is important since each database has a different fasta header format
Select "200 per page"
Select "FASTA (text)" under the Summary pulldown
Save the multifasta file under Lecture_05_regex/data/ and name it "FoxP2_primate.fasta".
Goal: Reorganize the data from a fasta format into a spreasheet format with one row per entry (gene) and the following columns:
For example, given the entry
>NP_001009020.1 forkhead box protein P2 [Pan troglodytes]\r\n MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQT\r\n SGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYK\r\n
The expected 5-part output would be
NP_001009020.1forkhead box protein P2 FoxP2Pan troglodytesMMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLL...Once the file is in the desired format, write the file to disk in CSV format.
In MS Excel, it should be organized something like
| Accession_number | Protein_info | Genus | Species | Sequence |
|---|---|---|---|---|
| NP_001009020.1 | forkhead box protein P2 | Pan | troglodytes | MMQESATETISNSSMNQNGMST... |
Let's do it!
readr::read_file()¶First, we need to read in the file that we just downloaded. We'll use the read_file() function from the readr package which is already imported with the tidyverse. The read_file() function is used to read a complete file into a single character vector - aka one long string.
# Take a quick look at our directory
getwd()
list.files("data/")
# read file as a single string
fasta <- read_file("data/FoxP2_primate.fasta")
# How big is this string?
...
# a single continuous string with all genes.
print(str_sub(...))
# Note the line breaks!
Before we start manipulating our data, let's inspect fasta. You'll notice from above that there are \r\n characters interspersed throughout the string which are translated as line breaks.
We already know it has 104,129 characters but let's double check the properties of the fasta object.
# Retrieve the structure
str(fasta)
# is it a vector?
...(fasta)
# What are it's dimensions?
dim(fasta) # Is unidimensional so the output of dim() is NULL
writelines() or cat() to see your output in a human-readable format¶Right now our data is in a character vector but it's just a single long character. We can't even see it properly withstr() or glimpse(). We tried the print() function but it did not account for line breaks.
The writelines() function, however, will allow us to see the data in a more understandable format. The cat() function will also interpret \n characters but it technically also puts multiple entries from a vector together with a sep= parameter.
To save on output space, we'll limit fasta to just the first 2000 lines as we did with print.
# We'll practice our piping here again at the same time
fasta %>%
# Subset our fasta string
str_sub(., 1, 2000) %>%
# Pass it along to writeLines which will properly interpret the line breaks
...
# We'll practice our piping here again at the same time
fasta %>%
# Subset our fasta string
str_sub(., 1, 2000) %>%
# Another base-r alternative, cat will also properly interpret the line breaks
...
Given that we know that > represents the beginning of an entry in fasta format, we can count the number of > as a proxy for total entries. Recall which function we can use to help count?
# Count the number of > in fasta
str_count(string = fasta, pattern = ...)
It looks like we have 130 entries. It's actually 129 but we'll get to that problem eventually. Time to work in our tasks!
Now that we know more about the nature of our fasta information, we can put together a plan for how we want to transform (parse) it into our final format. We see that:
> and ending with species information like [binomial name]. \r\n creating two linebreaks. Let's use the following plan to set our order of operations.
[Homo sapiens]> symbol from each entryFirst let's prepare a second variable subfasta which contains a smaller portion of our entries
# Make a smaller version of our string to play with
subfasta <-
fasta %>%
str_sub(., 1, 2000)
[ and ]¶The best way to go through the entire string is to use str_replace_all()
subfasta %>%
str_replace_all(pattern = ..., # look for opening or closing squared brackets
replacement = ...) %>% # replace squared brackets with nothing
writeLines() # Let's make it easier to look at
Each entry is separated by a few characters. If we had printed the output instead to take a closer look at the end of each entry we would see ...PELEDDREIEEEPLSEDLE\r\n\r\n>NP_683697.2. Contrast this to the characters between lines of amino acid sequence which are just \r\n. Can we take advantage of this difference?
We can use str_split() but what pattern should we give it?
# Use str_split to break up the line breaks
subfasta %>%
str_replace_all(pattern = r"(\[|\])", replacement = "") %>% # Remove square brackets
str_split(...) %>% # Use str_split to break up the line breaks. On unix you can use \\n\\n
head()
str_split() returns a list but we want to work with a vector! Let's quickly fix that.
# You need to convert the result to a vector to work with.
subfasta %>%
str_replace_all(pattern = r"(\[|\])", replacement = "") %>% # Remove square brackets
str_split(r"(\r\n\r\n)") %>% # Use str_split to break up the line breaks. On unix you can use \\n\\n
... # Convert our list to a character vector
> from the start of each entry¶Remember each entry is now separately stored in a character vector. We want to go through each and remove the > symbol. To accomplish this we will use str_replace_all which expects a character vector as input. Perfect!
# remove the ">" from the start of each entry
subfasta %>%
str_replace_all(pattern = r"(\[|\])", replacement = "") %>% # Remove square brackets
str_split(r"(\r\n\r\n)") %>% # Use str_split to break up the line breaks. On unix you can use \\n\\n
unlist() %>% # You need to unlist the result to work with easier
str_replace_all(pattern = ..., replacement = "") # remove the ">" from the start of each entry
str_split_fixed()¶Now we want to break away our header from the rest of the actual amino acid sequence in each entry. To accomplish this we'll use str_split_fixed(string, pattern, n) where:
string is our character vector that we want to split.pattern is the string pattern that we want to use to split each entry.n is the number of times we want to split each entry in the vector. Our pattern may occur multiple times but our call will return a character matrix with n columnsAt this point we'll switch back to using the full fasta string and save the resulting output into a new variable called header_seq_fasta. The function str_split_fixed() will return a character matrix with n columns.
# Break each entry into 2 columns with str_split_fixed and save it
header_seq_fasta <- fasta %>%
str_replace_all(pattern = r"(\[|\])", replacement = "") %>% # Remove square brackets
str_split(r"(\r\n\r\n)") %>% # Use str_split to break up the line breaks. On unix you can use \\n\\n
unlist() %>% # You need to unlist the result to work with easier
str_replace_all(pattern = "^>", replacement = "") %>% # remove the ">" from the start of each entry
str_split_fixed(pattern = ..., n = ...) %>% # Split the string by the first occurence of our pattern
as.data.frame()
head(header_seq_fasta)
duplicated() entries with the help of which()¶Okay, a quick side trip to remind/show you some of the tricks you can use in your data wrangling. Are there duplicated entries in our original fasta file? Perhaps there are unique entry headers but sequences may be duplicated. Let's cull them from the set for now.
How do we check which entries are duplicates of sequences already present in fasta?
# Which entries within our dataset are duplicated by sequence?
which(...)
# Use which to filter out duplicated copies
header_seq_subset <- header_seq_fasta[...]
# Take a look at the resulting subset
str(header_seq_subset)
duplicated() entries with the help of filter()¶Rather than use the more complicated code above, we can rely on dplyr to help us out with a call to filter() instead. In this case, recall that we want to filter for the non-duplicated entries. We can accomplish this with the logical not ! annotation.
Again we'll save the final filtered result into the variable header_seq_subset.
# Use filter to remove duplicated entries
header_seq_subset <-
header_seq_fasta %>% # Pass the header/seq data to filter
filter(...) # identify the duplicated entries and filter on the negated result
str(header_seq_subset)
Let's review. We now have a filtered data frame consisting of 2 columns. The first represents the headers from each fasta entry, and the second column contains the sequences of the fasta entry. Let's break our header into the components we wanted:
We'll accomplish this with str_match_all() which takes in our string and returns a list of our matched groups in a vectorised format. The key here is in the pattern where each group is defined by the matching pattern within each set of ( ) (parentheses). Let's keep in mind that the return output of str_match_all() will be a list of character matrices where the first column in each has the complete match, and each additional column is reserved for matching groups as defined above. If multiple matches are identified in the same header text it will produce a matrix with additional rows. Each header entry will result in a new list entry within the str_match_all() output.
In this case we are also going to take advantage of greedy matching to capture protein_info which may take any non-uniform format!
XP_007980836.1 PREDICTED: forkhead box protein P2 isoform X2 Chlorocebus sabaeus
Let's break this header down into some subcomponents which are all separated by whitespace
| Text Entry | Properties |
|---|---|
| XP_007980836.1 | alphanumeric series, separated by ".", ending with digits |
| PREDICTED: forkhead box protein P2 isoform X2 | alphanumeric with any number of words and spaces present |
| Chlorocebus | single alphabetical word |
| sabaeus | single alphabetical word |
fasta_header <-
header_seq_subset[,1] %>% # Pass along the data frame
# take just the first column (using select will return a data frame, we want a vector)
str_match_all(...)
# Take a look at the last few rows of our result
tail(fasta_header, 3)
We now have a list of 1x5 matrices where we only car about the information in column 2-5 of each matrix. How do we pull that information from each entry of the list itself? Currently fasta_header exists as a list where each element is a matrix. Let's convince ourselves in the following code cell.
# Look at the first entry of fasta_header
fasta_header[[1]]
# Two ways to drop the first column
print("Select our columns specifically...")
fasta_header[[1]][...]
print("Leave a column behind")
fasta_header[[1]][...]
How do we approach converting this to something like a data frame? There are a number of ways to do this and we'll show you a couple.
# Convert the results of str_match_all into a data frame
...(fasta_header[[1]][,2:5])
# You need to transpose the trimmed data for proper coercion
data.frame(...(fasta_header[[1]][, 2:5]))
rbind()¶There are a number of ways to add a row to a data frame. What must always be remembered is that to add a row, two criteria must be met:
Otherwise, the program may not stop but it will certainly not give you the output you were expecting.
A great way to add rows is with the rbind() command. Behind the scenes, when we call this command, the interpreter evaluates the inputs for the call and decides on which implementation of rbind() to use i.e. is this for a data frame? or a matrix? etc. In our case, we will explicitly call on rbind.data.frame() to ensure we get a data.frame as a result.
str(fasta_header[[2]])
# What happens if we call rbind?
...(fasta_header[[1]][,-1], fasta_header[[2]][,-1])
# What about rbind.data.frame?
...(fasta_header[[1]][,-1], fasta_header[[2]][,-1])
We want to repeat the above command for every entry within fasta_header. We haven't discussed loops yet, but will delve into these control structures in a future lecture. Here, however, we want to quickly use it to accomplish a repetitive action. Remember our rules for adding rows to a data frame!
# Create a data frame and place-hold the column names to match with what we have in fasta6
header.df <- rbind.data.frame(fasta_header[[1]][,-1])
rm(i) ## remove any previous instances of i
for (i in ...){
header.df <- rbind.data.frame(header.df, fasta_header[[...]][, -1])
}
head(header.df)
str(header.df)
do.call() to save youself some trouble¶The do.call() function "constructs and executes a function call from a name or a function and a list of arguments to be passed to it." It takes the form of:
do.call(what, args, quote=FALSE, envir = parent.frame())
where
How does this help us? Although not exactly the same, it works very similary to a lapply() function where we can provide a list and it will vectorize a function over that list. The distinction, however, is under the hood where lapply() will apply the same function to each element of a list, do.call will supply a list of arguments just one time to a function call. Furthermore, lapply() always returns a list versus do.call() which will return an object based on the function called.
Overall, that means we can do this...
# Let's see what happens if we use lapply first
...(fasta_header, rbind.data.frame) %>% head()
# Now use do.call
header.df <- do.call(rbind.data.frame, fasta_header) %>%
select(-1) # remember you need to get rid of the first column
Don't forget to rename your columns!
# rename the columns "Accession_number", "Protein_info", "Genus", and "Species"
... <- c("Accession_number", "Protein_info", "Genus", "Species")
head(header.df,2)
str_*() methods¶Our sequence entries in header_seq_subset are full of line breaks in the form of \r\n and we don't want these anymore. So how can we go about removing those easily?
There are a number of approaches we could take but let's use the tools we've already encountered:
str_split()str_flatten()# Confirm what our sequence data looks like
print(header_seq_subset[...])
# Let's split off all of the extra line breaks
header_seq_subset[,2] %>%
str_split(...) %>%
# Now you'll have a list of split sequences that you'll need to re-unite
lapply(., ...) %>%
# Turn this result back into a vector
unlist() %>%
# drop the last element because it's not helpful
head(-1)
# Let's remove all of the extra line breaks
seq_united <- header_seq_subset[,2] %>%
...(r"(\r\n)") %>%
# Now you'll have just have a vector that you need to trim
head(-1)
# Take a peek at the result
print(head(seq_united))
It's the last step. We want to add our final sequence information to header.df but there are a number of ways to perform this last step based on what we've learned. Remember we also want to have the new column named "Sequence"
cbind() can add to our data frame$ to create a new variable/column directly in header.dfdplyr verb to create a new columnLet's try some of them!
# we can combine the columns, but will we get the column names we want?
# Remember we already have the sequences in header.df
head(cbind(header.df[, -5], ...))
# If we use dplyr and %>% we can keep passing this around for further use
header.df %>%
... %>%
head()
# Replace the sequence column of our header.df but this is a permanent command!
header.df...
head(header.df)
We're done! Just like we wanted it:
| Accession_number | Protein_info | Genus | Species | Sequence |
|---|---|---|---|---|
| NP_001009020.1 | forkhead box protein P2 | Pan | troglodytes | MMQESATETISNSSMNQNGMST... |
And now you've started your RegEx journey!
That's the end for our fifth class on R! You've made it through Regular Expressions and we've learned about the following:
stringr package and its RegEx-based functionsSoon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete all chapters from the String manipulation in R with stringr course (5150 points total). This is a pass-fail assignment, and in order to pass you need to achieve a least 3,863 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:
You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, October 21st to submit your assignment (right before the next lecture).
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.
http://stat545.com/block022_regular-expression.html
http://stat545.com/block027_regular-expressions.html
http://stat545.com/block028_character-data.html
http://r4ds.had.co.nz/strings.html
http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf
http://www.opiniomics.org/biologists-this-is-why-bioinformaticians-hate-you/
https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054 >
http://emailregex.com/
https://regex101.com/
https://regexr.com/
https://www.regular-expressions.info/backref.html
https://www.regular-expressions.info/unicode.html
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
https://simongoring.shinyapps.io/RegularExpressionR/
I looked for a messy dataset for data cleaning and found it in a blog titled:
"Biologists: this is why bioinformaticians hate you..."
Challenge:
This is Wellcome Trust APC dataset on the costs of open access publishing by providing article processing charge (APC) data.
https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054
The main and common issue with this dataset is that when data entry was done there was no structured vocabulary; people could type whatever they wanted into free text answer boxes instead of using dropdown menus with limited options, giving an error if something is formatted incorrectly, or stipulating some rules (i.e. must be all lowercase, uppercase, no numbers, spacing, etc).
What I want to know is:
The route I suggest to take in answering these question is:
There is a README file to go with this spreadsheet if you have questions about the data fields.
</br>
The blogger's opinion of cleaning this dataset:
'I now have no hair left; I’ve torn it all out. My teeth are just stumps from excessive gnashing. My faith in humanity has been destroyed!'
Don't get to this point. The dataset doesn't need to be perfect. No datasets are 100% clean. Just do what you gotta do to answer these questions.
We can talk about how this went at the beginning of next week's lecture.
And we are done for the day! Well done!
There are some other things you can do with special codes and characters. Here are some other uses for writeLines() and cat() functions
writeLines("hello\n\U0001F30D") # U1F30D prints a planet icon on datacamp but not here
# similar to cat. Unicode characters in: http://www.unicode.org/charts/
cat("hello\n\U0001F30D")
# Use writeLines() with "\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e"
# Hello world in Hindi (taken from datacamp.com)
writeLines("\U00000928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e")
cat("\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e")